Dct-based scalable video compression

ABSTRACT

A method of encoding an input video signal for communication over a computer network, the method comprising the steps of: i) dividing each frame into a two-dimensional array of macroblocks; ii) detecting motion between each macroblock of a current frame and the corresponding macroblock of a previous frame, and coding only those macroblocks where motion is detected; iii) replacing all coefficients of non-coded macroblocks with zero coefficients; iv) applying discrete cosine transformation to coded macroblocks; v) reorganizing coefficients into a multi-resolution representation; vi) quantizing the coefficients with a uniform scalar quantizer to produce a significance map; and vii) adaptive arithmetic coding of said signal by encoding the motion information, encoding the significance map, encoding the signs of all significant coefficients, and encoding the magnitudes of significant coefficients, in bit-plane order starting with the most significant bit.

[0001] This application claims the benefit of priority from U.S.Provisional Patent Application No. 60/242,938 filed Oct. 24, 2000.

TECHNICAL FIELD

[0002] This invention relates generally to multimedia communications andmore particular to methods and systems for the compression/decompressionand encoding/decoding of video information transmitted over theInternet.

BACKGROUND

[0003] With the availability of high-performance personal computers andpopularity of broadband Internet connections, the demand forInternet-based video applications such as video conferencing, videomessaging, video-on-demand, etc. is rapidly increasing. To reducetransmission and storage costs, improved bit-ratecompression/decompression (“codec”) systems are needed. Image, video,and audio signals are amenable to compression due to considerablestatistical redundancy in the signals. Within a single image or a singlevideo frame, there exists significant correlation among neighboringsamples, giving rise to what is generally termed “spatial correlation”.Also, in moving images, such as full motion video, there is significantcorrelation among samples in different segments of time such assuccessive frames. This correlation is generally referred to as“temporal correlation”. There is a need for an improved, cost-effectivesystem and method that uses both spatial and temporal correlation toremove the redundancy in the video to achieve high compression intransmission and to maintain good to excellent image quality, whileadapting to change in the available bandwidth of the transmissionchannel and to the limitations of the receiving resources of theclients.

[0004] A known technique for taking advantage of the limited variationbetween frames of a motion video is known as motion-compensated imagecoding. In such coding, the current frame is predicted from thepreviously encoded frame using motion estimation and compensation, andonly the difference between the actual current frame and the predictedcurrent frame is coded. By coding only the difference, or residual,rather than the image frame itself, it is possible to improve imagequality, for the residual tends to have lower amplitude than the image,and can thus be coded with greater accuracy. Motion estimation andcompensation are discussed in Lim, J. S. Two-Dimensional Signal andImage Processing, Prentice Hall, pp. 497-507 (1990). However, motionestimation and compensation techniques have high computational cost,prohibiting software-only applications for most personal computers.

[0005] Further difficulties arise in the provision of a codec for anInternet streamer in that the bandwidth of the transmission channel issubject to change during transmission, and clients with varying receiverresources may join or leave the network as well during transmission.Internet streaming applications require video encoding technologies withfeatures such as low delay, low complexity, scalable representation, anderror resilience for effective video communications. The currentstandards and the state-of-the-art video coding technologies are provingto be insufficient to provide these features. Some of the developedstandards (MPEG-1, MPEG-2) target non-interactive streamingapplications. Although H.323 Recommendation targets interactiveaudiovisual conferencing over unreliable packet networks (such as theInternet), the applied H.26x video codecs do not support all thefeatures demanded by Internet-based applications. Although new standardssuch as H.263+ and MPEG-4 started to address some of these issues(scalability, error resilience, etc.), the current state of thesestandards is far from being complete in order to support a wide range ofvideo applications effectively.

[0006] Due to very heterogeneous networking and computinginfrastructure, highly scalable video coding algorithms are required. Avideo codec should provide reasonable quality to low-performancepersonal computers connected via a dial-up modem or a wirelessconnection, and high quality to high-performance computers connectedusing T1. Thus the compression algorithm is expected to scale well interms of both computational cost and bandwidth requirement.

[0007] Real Time Protocol (RTP) is most commonly used to carrytime-sensitive multimedia traffic over the Internet. Since RTP is builton the unreliable user datagram protocol (UDP), the coding algorithmmust be able to effectively handle packet losses. Furthermore, due tolow-delay requirements of the interactive applications and multicasttransmission requirements, the popular retransmission method widelydeployed over the Internet cannot be used. Thus the video codec shouldprovide high degree of resilience against network and transmissionerrors in order to minimize impact on visual quality.

[0008] Computational complexity of the encoding and decoding processmust be low in order to provide reasonable frame rate and quality onlow-performance computers (PDAs, hand-held computers, etc.) and highframe-rate and quality on average personal computers. As mentioned, thepopularly applied motion estimation and motion compensation techniqueshave high computational cost prohibiting software-only applications formost personal computers.

SUMMARY OF INVENTION

[0009] This invention provides a new method of video encoding byperforming reorganization, significance mapping, and bitstream layeringof coefficients derived from discrete transform operations on originalframes in which motion has been detected. Moving areas of a frame areupdated using intracoding or differential coding based upon thedifference between a current portion of a frame and the same portionfrom a previous frame. The previous frame can be either the previousoriginal frame or a previous reconstructed frame. In order to increaseerror resilience, part of the frame can be periodically updated(intracoded) in a distributed manner, regardless of whether or notmotion is detected. Discrete cosine transform (DCT) is carried out insub-portions of each of the coded frame portions. The resultingcoefficients are reorganized into a three-level multiple-resolutionrepresentation. The coefficients are then quantized into significant(one) and insignificant (zero) values, which determine a significancemap. The map is rearranged by order of significance into bitstreamlayers that are encoded using adaptive arithmetic coding. The codingscheme provides scalability, high coding performance, and errorresilience while incurring low coding delays and computationalcomplexity.

BRIEF DESCRIPTION OF DRAWINGS

[0010] In Figures which illustrate non-limiting embodiments of theinvention:

[0011]FIG. 1 is a functional block diagram of the encoder of a codecoperated in accordance with an embodiment of the invention.

[0012]FIG. 2 is a functional block diagram of the encoder of a codecoperated in accordance with an alternative embodiment of the invention,involving frame reconstruction at the encoder.

[0013]FIG. 3 is a pictorial representation illustrating how portions ofa frame are periodically updated (intracoded) in a distributed manner,in accordance with an embodiment of the invention.

[0014]FIG. 4 is a pictorial representation illustrating thereorganization of DCT coefficients into a multi-resolution structure.

[0015]FIG. 5 is a pictorial representation illustrating the spatialscalability provided by a codec according to the invention.

[0016]FIG. 6 is a pictorial representation illustrating the temporalscalability provided by a codec according to the invention.

[0017]FIG. 7 is a pictorial representation illustrating a probabilitymodel for coding/transmitting significance information.

[0018]FIG. 8 is a pictorial representation illustrating a probabilitymodel for coding/transmitting the sign and magnitude of significant DCTcoefficients.

DESCRIPTION

[0019] Throughout the following description, specific details are setforth in order to provide a more thorough understanding of theinvention. However, the invention may be practiced without theseparticulars. In other instances, well known elements have not been shownor described in detail to avoid unnecessarily obscuring the presentinvention. Accordingly, the specification and drawings are to beregarded in an illustrative, rather than a restrictive, sense.

[0020] Each of FIG. 1 and FIG. 2 depict a functional block diagram ofthe encoder of a codec according to embodiments of the invention. Ineach case, the encoding process consists of the following steps:

[0021] Motion detection;

[0022] Coefficients reorganization;

[0023] Coefficients partitioning;

[0024] Bitstream layering; and

[0025] Application of adaptive arithmetic coding.

[0026] Each of these steps will be explained in detail below. Thisprocess can be implemented using software alone, if desired.

[0027]FIG. 1 represents a simple encoding architecture according to theinvention. It uses the previous original frame as the reference. Theencoder does not maintain the state of the decoder, which means thaterror may accumulate between the original and reconstructed frames.Although error accumulation lowers the signal-to-noise ratio of theframes, a measure traditionally used for comparing objectiveperformance, research has determined that the impact on visual qualityis minimal. The codec architecture illustrated in FIG. 1 provides higherror resilience and low computational complexity.

[0028]FIG. 2 represents a more sophisticated encoding architectureaccording to the invention, including a step of frame reconstruction atthe encoder itself. In FIG. 2, the previously reconstructed frame isused as the reference for motion detection. Thus the encoder maintainsthe state of the decoder and the error accumulation (as in the previousstructure) is avoided resulting in higher objective performance (highersignal-to-noise ratio). However, this comes at the cost of highercomputational complexity of the encoder and higher error sensitivity ofthe bitstream.

[0029] Motion Detection

[0030] Motion detection is not to be confused with the computationallycomplex “motion estimation and compensation” technique described above.Motion detection does not involve predicting/estimating the currentframe based on a previous frame and encoding only the difference betweenthe prediction/estimation and the actual frame. Rather, motion detectionaccording to the present invention simply detects actual differencesbetween the current frame and a previous frame, offering high errorresilience and low computational complexity.

[0031] Motion detection is based on conditional replenishment—that is,only moving areas of the frame are updated using intracoding (encodingprocess which does not use data from the reference frame) and/ordifferential coding (encoding process where only the difference with thecorresponding reference frame data is encoded). In one embodiment of theinvention, each frame is divided into a two-dimensional array ofmacroblocks (MBs) each of size 16 pixels by 16 pixels, and thisprocedure is carried out for each MB separately. As illustrated in FIG.1 and FIG. 2, motion detection may use either the previous originalframe or the previous reconstructed frame as the reference.

[0032] When the previous original frame is used as reference, as in theembodiment illustrated in FIG. 1, each MB may have two states. If thedifference between current MB and previous original MB is larger than asingle given threshold, then the MB is coded, preferably by intracoding.Otherwise, the MB is not coded at all and the decoder replaces thepixels within the MB from the previous frame.

[0033] When the previous reconstructed frame is used as reference, as inthe embodiment illustrated in FIG. 2, each MB may have three states. Twothresholds T₁ and T₂ (T₁<T₂) are defined. If the difference between thecurrent MB and previous reconstructed MB is smaller than T₁, then the MBis not coded. If the difference is larger than T₁ but smaller then T₂,then preferably the difference between the current MB and previouslyreconstructed MB is coded (that is, by differential coding). Finally, ifthe difference is larger than T₂, then the MB is coded, preferably byintracoding.

[0034] In order to increase error resilience, part of the frame may beperiodically updated (intracoded) in a distributed manner, asillustrated in FIG. 3. Based on extensive experiments over the Internet,10% of the frame is preferably intracoded regardless of motion—that is,the entire frame is fully intracoded in ten frames. This forcedintraupdate results in significant performance gain in lossy networkssuch as the Internet compared to the traditional I-frame refresh appliedin standard video codecs such as H.263 and MPEG-2.

[0035] DCT Coefficients Reorganization

[0036] After motion detection, each MB is preferably further partitionedinto four nonoverlapping blocks each of size 8 pixels by 8 pixels. Ifthe MB is not coded, then all the coefficients are replaced with zerocoefficients. Otherwise, discrete cosine transform (DCT) is carried outon each of the blocks. After DCT, coefficients are reorganized into athree-level multi-resolution representation such as that shown in FIG.4. Coefficients reorganization provides two advantages:

[0037] It increases error resilience significantly. The effects ofpacket losses will be uniformly distributed over the entire imageinstead of being concentrated into specific image regions, providingless disturbing visual artifacts. Spatial scalability can be efficientlysupported, as shown in FIG. 5.

[0038] DCT Coefficients Partitioning

[0039] After DCT transformation and data reorganization, coefficientsare quantized with a uniform scalar quantizer. DCT coefficients that arequantized to nonzero are termed significant coefficients. DCTcoefficients that are quantized to zero are insignificant coefficients.Thus the quantization procedure determines a map of significantcoefficients which can be described as a significance map. Thesignificance map is a binary image of size equal to that of the originalimage. Binary “0” means that the DCT coefficient at the correspondinglocation is insignificant, and binary “1” means that the DCT coefficientat the corresponding location is significant.

[0040] Bitstream Layering

[0041] One of the key advantages of a codec according to this inventionis its scalability in multiple dimensions. It provides three levels oftemporal scalabllity and three levels of spatial scalabllity. FIG. 6illustrates three levels of temporal scalability. In FIG. 6, “Q” meansquarter resolution, “H” means “half resolution”, “F” means “fullresolution”, and “GOF” means “group of frames”.

[0042] Take, for example, a video sequence encoded at 30 frames/second.The receiver is able to decode the video at 30 frames/second, 15frames/second, or 7.5 frames/second. A codec according to the presentinvention also supports three levels of spatial 30 scalability, as shownin FIG. 5. This means that if the original video is encoded at 352×288pixel resolution, then it can be decoded at full spatial resolution(352×288 pixels), half spatial resolution (176×144 pixels), or quarterspatial resolution (88×72 pixels) as well.

[0043] In addition to temporal and spatial scalability, SNR scalabilityis also supported by transmitting significant coefficients in bit-planeorder starting from the most significant bit.

[0044] Adaptive Arithmetic Coding

[0045] Adaptive arithmetic coding is a four-stage procedure. First, themotion information is encoded. Second, the significance map is encoded.Third, the signs of all significant coefficients need are encoded.Finally, magnitudes of significant coefficients are encoded.

[0046] The motion information is encoded by using adaptive arithmeticcoding. If the previous original frame is used as the reference, onlytwo symbols are needed (“not coded” and “intracoded”). If the previousreconstructed frame is used as reference, then three symbols are usuallyneeded (“not coded”, “differentially coded”, and “intracoded”).

[0047] The significance information of the DCT coefficients of the codedMBs needs to be encoded. The frame is scanned in “subband” order fromcoarse to fine resolution. Within each subband, the coefficients arescanned from top to bottom, left to right. It is observed that a largepercentage (about 80%) of the DCT coefficients is insignificant. Binaryadaptive arithmetic coding is used to encode the significanceinformation of DCT coefficients. SIG and INSIG symbols denotesignificant and insignificant coefficients, respectively. Duringadaptive arithmetic coding, each pixel may be coded assuming a differentprobability model (each a “context”). In one embodiment, the context ofeach pixel is based on the significance status of its four neighboringpixels in its causal neighborhood as shown in FIG. 7. In FIG. 7, thecausal neighborhood of a pixel is the set of four pixels (3 pixels inthe previous row and one pixel in the previous column) which are alreadyencoded or decoded. Since the encoder and decoder shall use the sameprobability model during the coding process, the probability model ofthe current pixels can be based only on the knowledge of the alreadytransmitted pixels. In this embodiment of the invention, a total of fiveprobability models are used (0 significant, 1 significant, . . . , 4significant).

[0048] The sign of significant DCT coefficients is also encoded usingbinary adaptive arithmetic coding. POS and NEG symbols denote positiveand negative significant DCT coefficients, respectively. In this case,the context is determined by the number of significant coefficients in asmall spatial neighborhood shown in FIG. 8. This context is calculatedfrom the already transmitted significance map. In this embodiment of theinvention, a total of five models are used (0 significant, 1significant, . . . , 4 significant).

[0049] Magnitudes of significant DCT coefficients are encoded inbit-plane order. Again, a binary adaptive arithmetic codec with twosymbols is used to encode each bit-plane. As with the sign ofsignificant coefficients, the context is determined by the number ofsignificant DCT coefficients in a small neighborhood such as that shownin FIG. 8. The transmission order proceeds from the most significantbit-plane to the least significant bit-plane providing SNR scalability.

[0050] In respect of the embodiment of the invention illustrated in FIG.2, involving frame reconstruction, an inverse discrete cosine transformis carried out on the significance maps for the purpose ofreconstructing each frame at the encoder.

[0051] A video codec according to the present invention provides thefollowing highly desirable features:

[0052] Low coding delay: Since no frame buffering is needed, the codecprovides the lowest possible coding delay. As soon as a frame arrivesfrom the capture device, it is coded immediately.

[0053] Low computational complexity: The encoder and decoder of thecodec have low and approximately symmetric computational complexity.This is because the method of the invention uses a simpler process ofmotion detection than the traditional process of motion estimation andcompensation. The encoder has lower complexity than block-based motionestimation/compensation schemes by an order of magnitude. Oneimplementation of the codec provides about 40 frames/second with a framesize of 176×144 pixels and with both the encoder and decoder executingon a 667 MHz Pentium III computer with a Windows 2000 operating system.

[0054] Scalable coding: The codec provides temporal, resolution, and SNRscalability. This enables a distributed video application such as videoconferencing to dynamically adjust video resolution, frame rate, andpicture quality of the received video depending on the available networkbandwidth and hardware capabilities of the receiver. Receivers with highbandwidth network connections and high-performance computers may receivehigh quality color video at high frame rate while receivers with lowbandwidth connections and low-performance computers may receive lowerquality video at a lower frame rate.

[0055] Error resilience: The bitstream of the codec provides a highdegree of error resilience for networks with packet loss such as theInternet. The encoding technique limits the effect of channel errors inthe smallest possible temporal and spatial neighborhood. Undesirabletemporal and spatial error propagation is prevented by using intraupdateand/or differential update instead of motion estimation/compensation.Furthermore, the applied coefficients reorganization and distributedforced intraupdate significantly increases the error resilience of thecodec.

[0056] High coding performance: High coding performance is needed inorder to provide users with high video quality over low bandwidthnetwork connections. A more efficient compression algorithm means thatusers receive higher picture quality for a given bit rate. The codecprovides similar visual quality when compared with H.263 for low motionscenes, and a little compromise in visual quality for high motion scenes(due to lack of motion estimation and compensation). However, theperformance compromise for high motion scenes is more than justified bythe low coding delay, low computational complexity, scalable bitstream,and high error resilience.

[0057] As will be apparent to those skilled in the art in the light ofthe foregoing disclosure, many alterations and modifications arepossible in the practice of this invention without departing from thescope thereof. Accordingly, the scope of the invention is to beconstrued in accordance with the substance defined by the followingclaims.

What is claimed is:
 1. A method of encoding an input video signalcomprising a group of video frames for communication over a computernetwork, the method comprising the steps of: i) dividing each frame intoa two-dimensional array of macroblocks; ii) detecting motion betweeneach macroblock of a current frame and the corresponding macroblock of aprevious frame, and coding only those macroblocks where motion isdetected; iii) replacing all coefficients of non-coded macroblocks withzero coefficients; iv) applying discrete cosine transformation (DCT) tocoded macroblocks to produce DCT coefficients; v) reorganizingcoefficients into a multi-resolution representation; vi) quantizing thecoefficients with a uniform scalar quantizer to produce a significancemap; and vii) adaptive arithmetic coding of said signal by a) encodingthe motion information; b) encoding the significance map; c) encodingthe signs of all significant coefficients; and d) encoding themagnitudes of significant coefficients, in bit-plane order starting withthe most significant bit.
 2. The method of claim 1 wherein said previousframe is the previous original frame.
 3. The method of claim 2 whereinthe motion detection and coding step is such that: i) if the motiondetected does not exceed a threshold, then said macroblock is not coded;and iii) if the motion detected exceeds said threshold, then saidmacroblock is intracoded.
 4. The method of claim 1 further comprisingthe step of reconstructing each frame from the motion information andsignificance map, and wherein said previous frame is the previousreconstructed frame.
 5. The method of claim 4 wherein the motiondetection and coding step is such that: i) if the motion detected doesnot exceed a first threshold, then said macroblock is not coded; ii) ifthe motion detected exceeds said first threshold but does not exceed asecond threshold, then the difference between said current frame andsaid previous frame is coded; and iii) if the motion detected exceedsboth first and second thresholds, then said macroblock is intracoded. 6.The method of claim 1 further comprising the step, regardless of whetheror not motion is detected, of periodically intracoding portions of eachframe in a distributed manner.
 7. A computer program product forencoding an input video signal comprising a group of video frames forcommunication over a computer network, said computer program productcomprising: (A) a computer usable medium having computer readableprogram code means embodied in said medium for: i) dividing each frameinto a two-dimensional array of macroblocks; ii) detecting motionbetween each macroblock of a current frame and the correspondingmacroblock of a previous frame, and coding only those macroblocks wheremotion is detected; iii) replacing all coefficients of non-codedmacroblocks with zero coefficients; iv) applying discrete cosinetransformation (DCT) to coded macroblocks to produce DCT coefficients;v) reorganizing coefficients into a multi-resolution representation; vi)quantizing the coefficients with a uniform scalar quantizer to produce asignificance map; and vii) adaptive arithmetic coding of said signal bya) encoding the motion information; b) encoding the significance map; c)encoding the signs of all significant coefficients; and d) encoding themagnitudes of significant coefficients, in bit-plane order starting withthe most significant bit.
 8. The computer program product of claim 7wherein said previous frame is the previous original frame.
 9. Thecomputer program product of claim 8 wherein the motion detection andcoding means are such that: i) if the motion detected does not exceed athreshold, then said macroblock is not coded; and iii) if the motiondetected exceeds said threshold, then said macroblock is intracoded. 10.The computer program product of claim 7 wherein said computer usablemedium further has computer readable code means embodied in same mediumfor reconstructing each frame from the motion information andsignificance map, and wherein said previous frame is the previousreconstructed frame.
 11. The computer program product of claim 10wherein the motion detection and coding means are such that: i) if themotion detected does not exceed a first threshold, then said macroblockis not coded; ii) if the motion detected exceeds said first thresholdbut does not exceed a second threshold, then the difference between saidcurrent frame and said previous frame is coded; and iii) if the motiondetected exceeds both first and second thresholds, then said macroblockis intracoded.
 12. The computer program product of claim 7 wherein saidcomputer usable medium further has computer program code means for,regardless of whether or not motion is detected, periodicallyintracoding portions of each frame in a distributed manner.
 13. Anarticle comprising: (A) a computer readable modulated carrier signal;(B) means embedded in said signal for encoding an input video signalcomprising a group of video frames for communication over a computernetwork, said means comprising means for: i) dividing each frame into atwo-dimensional array of macroblocks; ii) detecting motion between eachmacroblock of a current frame and the corresponding macroblock of aprevious frame, and coding only those macroblocks where motion isdetected; iii) replacing all coefficients of non-coded macroblocks withzero coefficients; iv) applying discrete cosine transformation (DCT) tocoded macroblocks to produce DCT coefficients; v) reorganizingcoefficients into a multi-resolution representation; vi) quantizing thecoefficients with a uniform scalar quantizer to produce a significancemap; and vii) adaptive arithmetic coding of said signal by a) encodingthe motion information; b) encoding the significance map; c) encodingthe signs of all significant coefficients; and d) encoding themagnitudes of significant coefficients, in bit-plane order starting withthe most significant bit.
 14. The article of claim 13 wherein saidprevious frame is the previous original frame.
 15. The article of claim14 wherein the motion detection and coding means are such that: i) ifthe motion detected does not exceed a threshold, then said macroblock isnot coded; and iii) if the motion detected exceeds said threshold, thensaid macroblock is intracoded.
 16. The article of claim 13 wherein saidembedded means further comprises means for reconstructing each framefrom the motion information and significance map, and wherein saidprevious frame is the previous reconstructed frame.
 17. The article ofclaim 16 wherein the motion detection and coding means are such that: i)if the motion detected does not exceed a first threshold, then saidmacroblock is not coded; ii) if the motion detected exceeds said firstthreshold but does not exceed a second threshold, then the differencebetween said current frame and said previous frame is coded; and iii) ifthe motion detected exceeds both first and second thresholds, then saidmacroblock is intracoded.
 18. The article of claim 13 wherein saidembedded means further comprises means for, regardless of whether or notmotion is detected, periodically intracoding portions of each frame in adistributed manner.